Generating high-performance arithmetic operators for FPGAs

نویسندگان

  • Florent de Dinechin
  • Cristian Klein
  • Bogdan Pasca
چکیده

This article addresses the development of complex, heavily parameterized and flexible operators to be used in FPGA-based floating-point accelerators. Languages such as VHDL or Verilog are not ideally suited for this task. The main problem is the automation of problems such as parameterdirected or target-directed architectural optimization, pipeline optimization, and generation of relevant test benches. This article introduces FloPoCo, an open object-oriented software framework designed to address these issues. Written in C++, it inputs operator specifications, a target FPGA and and an objective frequency, and outputs synthesisable VHDL fine-tuned for this FPGA at this frequency. Its design choices are discussed and validated on various operators. 1 Arithmetic operator design 1.1 Floating-point and FPGAs FPGA-based coprocessors are available from a variety of vendors, and it is natural to try and use them for accelerating floating-point (FP) applications. On floating-point matrix multiplication, their ∗This work was partly supported by the XtremeData university programme, the ANR EVAFlo project and the Egide Brâncuşi programme 14914RL. floating-point performance slightly surpasses that of a contemporary processor [6], using tens of operators on the FPGA to compensate their much slower frequency (almost one order of magnitude). However, FPGAs are no match to GPUs here. For other FP operations that are performed in software in a processor (for instance all the elementary functions such as exp, log, trigonometric...) there is much more speedup potential: One may design a dedicated pipelined architecture on an FPGA that outperforms the corresponding processor code by one order of magnitude while consuming a fraction of the FPGA resources [4]. Implementing the same architecture in a processsor would be wasted silicon, since even the logarithm is a relatively rare function typical processor workloads. For the same reason, GPUs have hardware acceleration for a limited set of functions and in single precision only. In an FPGA, you pay the price of this architecture only if your application needs it. Besides, operators can also be specialized in FPGAs. For example, a squarer theoretically requires half the logic of a multiplier; A floating-point multiplication by the constant 2.0 boils down to adding one to the exponent (a 12-bit addition in double-precision), and shouldn’t use a full-blown FP multiplier as it does in a processor. Actually it is possible to build an optimized architecture for any multiplication by a constant [2]. Finally, operators can be fused on an FPGA, for example the Euclidean norm √ x2 + y2 can be implemented more efficiently than by linking 1 en sl -0 03 21 20 9, v er si on 1 12 S ep 2 00 8 two squarers, one adder and one square root operator. There are many more opportunities for floatingpoint on FPGAs [3]. The object of the FloPoCo project is to study and develop such FPGA-specific Floating-Point Cores. 1.2 From libraries to generators FloPoCo is not a library but a generator of operators. Indeed, it is the successor to FPLibrary, a library of operators written in VHDL. Many parts of FPLibrary were actually generated by as many ad-hoc programs, and FloPoco started as an attempt to bring all these programs in a unified framework. A first reason is that it is not possible, for instance, to write by hand, directly in VHDL or Verilog, an optimized multiplier by a constant for each of an infinite number of constants. However, this task is easy to automate in a program that inputs the constant. Another reason is the need for flexibility. Whether the best operator is a slow and small one or a faster but larger one depends on the context. FPGAs also allow flexibility in precision: arithmetic cores are parameterized by the bit-widths of their inputs and outputs. Flexibility also makes it possible to optimize for different hardware targets, with different LUT structure, memory and DSP features, etc. Thus, the more flexible an operator, the more future-proof. Finally, for complex operators such as elementary function evaluators, the optimal design is the result of a costly design-space exploration, which is best performed by a computer. VHDL and Verilog are good for describing a library of operators optimized for a given context, but the more flexibility and the more design-space exploration one wants, the more difficult it gets. It is natural to write operator generators instead. A generator inputs user specifications, performs any relevant architectural exploration and construction (sometimes down to pre-placement), and outputs the architecture in a synthesizable format. To our knowledge, this approach was pioneered by Xilinx with their core 1www.ens-lyon.fr/LIP/Arenaire/Ware/FloPoCo/ generator tool. An architecture generator needs a back-end to actually implement the resulting circuit. The most elegant solution is to write an operator generator as an overlay on a software-based HDL such as SystemC, JBits, HandelC or JHDL (among many others). The advantages are a preexisting abstraction of a circuit, and simple integration with a one-step compilation process. The inconvenient is that most of these languages are still relatively confidential and restricted in the FPGAs they support. Basing FloPoCo on a vendor generator would be an option, but would mean restricting it to one FPGA family. FloPoCo therefore took a less elegant, but more universal route. The generator is written in a mainstream programming language (we chose C++), and it outputs operators in a mainstream HDL (we chose standard synthesisable VHDL). Thus, the FloPoCo generator is portable, and the generated operators can be integrated into most projects, simulated using mainstream simulators, and synthesized for any FPGA using the vendor back-end tools. Section 2.2 will show how they can nevertheless be optimized to a given FPGA target. The inconvenient of this approach is that we had to develop a framework, instead of reusing one. Section 2 describes this framework and the way it evolved in a very practical and bottom-up way. 1.3 The arithmetic context It is important to understand that this framework was developed only with arithmetic operators in view. An arithmetic operator is the implementation of a mathematical function, and this underlying mathematical nature is exploited pervasively in FloPoCo. For instance, an operator may be combinational or pipelined, but will usually involve no feedback loop or state machine (the only current exception is an accumulator). With this restriction, we are able to implement a simple, efficient and automatic approach to pipelining (see section 3) and testbench generation (see section 4). As another example, when generating 2We would welcome any feedback on early architecture generators en sl -0 03 21 20 9, v er si on 1 12 S ep 2 00 8 test benches, relevant test patterns may be defined by function analysis, and the expected output is defined as a mathematical function of the input, composed with a well-defined rounding function [1]. These are only a few examples. The design-space exploration for complex operators is based on automated error analysis [4], which is also specific to the arithmetic context. FloPoCo is not only a generator framework, it is also a generator of arithmetic cores using this framework. It currently offers about 20 operators, from simple ones such as shifters or integer adders to very complex ones such as floating-point exp and log. This article is not about these operators, but will be illustrated by actual examples of already implemented operators. FloPoCo is distributed under the LGPL, and interested readers are welcome to try it, use it and improve it. 2 The FloPoCo framework The FloPoCo generator inputs (currently in the command-line) a list of operator specifications, internally builds a list of Operator objects (some of which may be sub-components of the specified operators), then outputs the corresponding VHDL.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Using Floating-Point Arithmetic on FPGAs to Accelerate Scientific N-Body Simulations

This paper investigates the usage of floating-point arithmetic on FPGAs for N-Body simulation in natural science. The common aspect of these applications is the simple computing structure where forces between a particle and its surrounding particles are summed up. The role of reduced precision arithmetic is discussed, and our implementation of a floating-point arithmetic library with parameteri...

متن کامل

Variable Precision Floating-Point Divide and Square Root for Efficient FPGA Implementation of Image and Signal Processing Algorithms

Field Programmable Gate Arrays (FPGAs) are frequently used to accelerate signal and image processing algorithms due to their flexibility, relatively low cost, high performance and fast time to market. For those applications where the data has large dynamic range, floating-point arithmetic is desirable due to the inherent limitations of fixed-point arithmetic. Moreover, optimal reconfigurable ha...

متن کامل

Customizing Floating-Point Operators for Linear Algebra Acceleration on FPGAs

Accelerating the execution of algorithms involving floating-point computations is currently a necessity in the scientific community. A solution – FPGAs – are believed to provide the best balance between costs, performance and flexibility. The FPGA’s flexibility can be best exploited when used to accelerate ”exotic operators”(log, exp, dot product) and operators tailored for the numerics of each...

متن کامل

Reconfigurable arithmetic for HPC

An often overlooked way to increase the efficiency of HPC on FPGA is to tailor, as tightly as possible, the arithmetic to the application. An ideally efficient implementation would, for each of its operations, toggle and transmit just the number of bits required by the application at this point. Conventional microprocessors, with their word-level granularity and fixed memory hierarchy, keep us ...

متن کامل

Customizing floating-point units for FPGAs: Area-performance-standard trade-offs

Keywords: Floating-point arithmetic FPGAs Library of operators High performance The high integration density of current nanometer technologies allows the implementation of complex floating-point applications in a single FPGA. In this work the intrinsic complexity of floating-point operators is addressed targeting configurable devices and making design decisions providing the most suitable perfo...

متن کامل

High-performance hardware operators for polynomial evaluation

This paper presents some recent works on hardware evaluation of functions. A method for the automatic generation of high-performance arithmetic operators based on polynomial approximations is described. It deals with the bit-level representation of the polynomial coefficients, the intermediate computations width, the approximation and the rounding errors. The generated operators are small, fast...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009